Data for this HW is stored at /n/stat115/2021/HW6/data

Part I: Data exploration on TCGA

The Cancer Genome Atlas (TCGA) is an NCI project to comprehensively profile over 10K tumors in 33 cancer types. In this homework, we are going to explore TCGA data analysis.

Q1. Go to TCGA GDC website (https://portal.gdc.cancer.gov/) and explore the GDC data portal. How many glioblastoma (GBM) cases in TCGA meet ALL of the following requirements?

  1. Male;

  2. Diagnosed at the age above 45;

  3. Still alive.

Answer: One the TCGA GDC website, I found 42 cases met all the requirements. Below are the screenshots of the filters I applied.

Site Specific Filters

Demographic Specific Filters

Q2. TCGA GDC (https://portal.gdc.cancer.gov/) and Broad Firehose (http://firebrowse.org/) both provide processed TCGA data for downloading and downstream analysis. Download clinical data of GBM. What’s the average diagnosed age of all GBM patients?

Answer: The abbreviation for glioblastoma is GBM. The average diagnosed age of all GBM patients is 57.81681 years.

# get abbreviations
cohorts = Metadata.Cohorts(format = "csv")
head(cohorts, 15)
##      cohort                                                      description
## 1       ACC                                         Adrenocortical carcinoma
## 2      BLCA                                     Bladder Urothelial Carcinoma
## 3      BRCA                                        Breast invasive carcinoma
## 4      CESC Cervical squamous cell carcinoma and endocervical adenocarcinoma
## 5      CHOL                                               Cholangiocarcinoma
## 6      COAD                                             Colon adenocarcinoma
## 7  COADREAD                                        Colorectal adenocarcinoma
## 8      DLBC                  Lymphoid Neoplasm Diffuse Large B-cell Lymphoma
## 9      ESCA                                            Esophageal carcinoma 
## 10     FPPP                                              FFPE Pilot Phase II
## 11      GBM                                          Glioblastoma multiforme
## 12   GBMLGG                                                           Glioma
## 13     HNSC                            Head and Neck squamous cell carcinoma
## 14     KICH                                               Kidney Chromophobe
## 15    KIPAN                               Pan-kidney cohort (KICH+KIRC+KIRP)
# Download GBM data (have to do this because we can only receive 150 patients at a time)
all.Received <- FALSE
page.Counter <- 1
page.size <- 150
gbm_pats <- list()
while(all.Received == FALSE) {
    gbm_pats[[page.Counter]] <- Samples.Clinical(format = "csv",
            cohort = "GBM", page_size = page.size, page = page.Counter)
    if(page.Counter > 1) {
        colnames(gbm_pats[[page.Counter]]) <-
            colnames(gbm_pats[[page.Counter-1]])
    }

    if(nrow(gbm_pats[[page.Counter]]) < page.size) {
        all.Received = TRUE
    } else {
        page.Counter = page.Counter + 1
    }
}

gbm_pats <- do.call(rbind, gbm_pats)

#Dimension of the table containing patient info
dim(gbm_pats)
## [1] 595  55
# get average age of diagnosis
mean(gbm_pats$age_at_initial_pathologic_diagnosis)
## [1] 57.81681

Part II – Tumor Subtypes

Q1. GBM is one of the earliest cancer types to be processed by TCGA, and the expression profiling was initially done with Affymetrix microarray. Also, with brain cancer, it is hard to get sufficient number of normal samples. We provide the pre-processed expression matrix in (GBM_expr.txt) where samples are columns and genes are rows. Do a K-means (k=3) clustering from all the genes and the most variable 2000 genes. Do tumor and normal samples separate in different clusters? Do the tumors samples consistently separate into 2 clusters, regardless of whether you use all the genes or most variable genes?

Answer: I couldn’t find the GBM_expr.txt file on cannon so I instead pulled the file from the course’s github from the 2020 HW repo. I posted a question on Slack about this however I received no response, so I’ll do the analysis with this data. Note the code would be the same if we provided another file.

The file provides 60 tumor samples and 10 normal samples. The data is already log2-transformed. Samples that start with Normal are normal samples while those that start with TCGA will be the tumor samples. I’ll do K-means (k = 3) with all genes and top 2000 variable genes with the recommended parameters of nstart = 10 and iter.max = 100

# get the file (missing from cannon)
scp stat215u2107@login.rc.fas.harvard.edu:/n/stat115/2021/HW6/data/GBM_expr.txt part2/GBM_expr.txt
# your code here
gbm_expr = read.csv("part2/GBM_expr.txt", sep = "\t")
# K-means for all genes
set.seed(215)
gbm_expr_all = t(gbm_expr[, -1])
k3 = kmeans(gbm_expr_all, centers = 3, nstart = 10, iter.max = 100)
fviz_cluster(k3, data = gbm_expr_all, geom = "point")

table(k3$cluster)
## 
##  1  2  3 
## 30 10 30

The above K-means separates the normal and tumor samples based on all genes. The normal samples are the green cluster (cluster 2) with 10 samples. We also see two separate clusters (red cluster 1 and blue cluster 3) for the tumor samples each with 30 samples (however there is some small overlap). I check what samples got placed in which clusters by exploring the k3 object.

# find the top 2000 variable genes
gbm_expr_top2k = gbm_expr[, -1] # make copy with numbers
gbm_expr_top2k$var = rowVars(data.matrix(gbm_expr_top2k)) # find variances
gbm_expr_top2k = gbm_expr_top2k[order(-gbm_expr_top2k$var), ] # sort genes
gbm_expr_top2k = head(gbm_expr_top2k, 2000) # get top 2000 variable genes
gbm_expr_top2k = gbm_expr_top2k[, !names(gbm_expr_top2k) %in% c("var")] # drop the variance column
# K-means for top 2000 variable genes
set.seed(215)
gbm_expr_top2k_kmeans = t(gbm_expr_top2k)
k3 = kmeans(gbm_expr_top2k_kmeans, centers = 3, nstart = 10, iter.max = 100)
fviz_cluster(k3, data = gbm_expr_top2k_kmeans, geom = "point")

table(k3$cluster)
## 
##  1  2  3 
## 30 10 30

The above K-means separates the normal and tumor samples based on top 2000 variable genes. The normal samples are the green cluster (cluster 2) with 10 samples. We also see two separate clusters (red cluster 1 and blue cluster 3) for the tumor samples each with 30 samples.

Therefore, using top 2000 variable genes and all genes results in nice cluster separation - there’s clear division among the normal, and two types of tumor samples. However the K-means graph with top 2000 variable genes looks better than the K-means graph with all genes as there are no overlaps in the clusters and the PCs explain more of the variance. Hence, we’ll stick with using top 2000 variable gene for future analysis.

Q2. LIMMA is a BioConductor package that does differential expression between microarrays, RNA-seq, and can remove batch effects (especially if you have experimental design with complex batches). Use LIMMA to see how many genes are differentially expressed between the two GBM subtypes (with FDR < 0.05 and logFC > 1.5)?

Answer: We are interested in the two GBM subtypes - I’ll use the two clusters corresponding to the tumor samples only from Q1 using only the 2000 top variable genes - those are namely clusters 1 and 3. We will need make a design matrix. After that, we did LIMMA analysis. Using ebayes and topTable (which is specified with the criteria FDR < 0.05 and logFC > 1.5 - we can view adjusted p-value as equivalent to FDR in topTable), I got 385 genes are differentially expressed between the two GBM subtypes.

# remaking the top 2K variable genes with proper row names
gbm_expr_top2k = gbm_expr 
rownames(gbm_expr_top2k) = gbm_expr$X # set rownames as genes
gbm_expr_top2k = gbm_expr_top2k[, -1] # drop gene name row
gbm_expr_top2k$var = rowVars(data.matrix(gbm_expr_top2k)) # find variances
gbm_expr_top2k = gbm_expr_top2k[order(-gbm_expr_top2k$var), ] # sort genes
gbm_expr_top2k = head(gbm_expr_top2k, 2000) # get top 2000 variable genes
gbm_expr_top2k = gbm_expr_top2k[, !names(gbm_expr_top2k) %in% c("var")] # drop the variance column
# make the design matrix
tumor_clusters = data.frame(k3$cluster[c(1:60)]) # get the tumor clusters
tumor_clusters$type = 1*(tumor_clusters$k3.cluster.c.1.60.. == 1) # label the types
design_matrix = model.matrix(~ type, tumor_clusters) # get design matrix

# LIMMA analysis
lmfit_expr = lmFit(gbm_expr_top2k[, c(1:60)], design_matrix)
lmfit_expr = eBayes(lmfit_expr)
dep_expr = topTable(lmfit_expr, coef = 2, p.value = 0.05, lfc = 1.5, number = 2000)
head(dep_expr, 10)
##             logFC   AveExpr         t      P.Value    adj.P.Val        B
## LGALS3   3.709516 10.222698  16.68256 3.974742e-25 7.949483e-22 46.79734
## DLL3    -3.575501  6.741114 -14.60857 3.619115e-22 3.613736e-19 40.10827
## ANXA2    2.597845 11.048910  14.49053 5.420605e-22 3.613736e-19 39.71069
## TIMP1    3.447083 10.468646  14.16414 1.671846e-21 6.804336e-19 38.60167
## CXXC4   -2.251179  5.903109 -14.15915 1.701084e-21 6.804336e-19 38.58460
## EFEMP2   2.830607  6.896656  13.97657 3.213519e-21 1.071173e-18 37.95793
## ZNF804A -2.678269  5.819445 -13.61073 1.164375e-20 3.326785e-18 36.68894
## TAGLN2   2.268154  7.584229  13.41395 2.343760e-20 5.859400e-18 35.99901
## RBP1     3.503199  6.930935  13.07462 7.922769e-20 1.760615e-17 34.79724
## DYNLT3   2.699264  8.763837  12.90525 1.463142e-19 2.926285e-17 34.19172
nrow(dep_expr)
## [1] 385

Q3. For Graduate Students: From the DNA methylation profiles (GBM_meth.txt), how many genes are significantly differentially methylated between the two subtypes? Are DNA methylation associated with higher or lower expression of these genes? How many differentially expressed genes have an epigenetic (DNA methylation) cause?

Answer: First we get the DNA methylation profiles from Cannon. Not all samples have methylation profiles, so we need to redesign the design matrix by focusing on the samples we have in DNA methylation profiles. A quick inspection shows us that the samples in DNA methylation profiles are all tumor samples that correspond by the sample label from the previous questions. I perform this matching. We also need to logit transform the values since the methylation values is in range 0 to 1 before running limma. I’ll also filter the DNA methylation profiles to the top 2000 variable genes similar to how we focused on top 2000 variable genes in the expression data.

Using ebayes and topTable (which is specified with the criteria FDR < 0.05 and logFC > 1.5), I got 735 genes are differentially methylated between the two GBM subtypes.

To compare with methylation with expression, I will use intersect(), then plot methylation logFC against expression logFC, and calculate the association. I got that 62 genes are differentially methylated and differentially expressed. The resulting plot below shows a negative association between the two, and the resulting correlation is -0.4954892. This indicates that high methylation is associated with low gene expression which aligns with our in class discussion on methylation and gene expression.

To determine the number of differentially expressed genes that have an epigenetic (DNA methylation) cause, I’ll look at the genes I get from intersect() and find those that have opposite differential expression and methylation directions (according to lab 10.2). I found 60 such genes.

# get the file 
scp stat215u2107@login.rc.fas.harvard.edu:/n/stat115/2021/HW6/data/GBM_meth.txt part2/GBM_meth.txt
# load in the data
gbm_meth = read.csv("part2/GBM_meth.txt", sep = "\t")
rownames(gbm_meth) = gbm_meth$X # set rownames as genes
gbm_meth = gbm_meth[, -1] # drop gene name row

# filter for top 2000 variable genes
gbm_meth$var = rowVars(data.matrix(gbm_meth)) # find variances
gbm_meth = gbm_meth[order(-gbm_meth$var), ] # sort genes
gbm_meth_top2k = head(gbm_meth, 2000) # get top 2000 variable genes
gbm_meth_top2k = gbm_meth_top2k[, !names(gbm_meth_top2k) %in% c("var")] # drop the variance column

gbm_meth_top2k = log(gbm_meth_top2k/(1 - gbm_meth_top2k)) # performing logit
# subset the design_matrix
meth_design_matrix = subset(design_matrix, rownames(design_matrix) %in% colnames(gbm_meth_top2k))

# LIMMA analysis
lmfit_meth = lmFit(gbm_meth_top2k, meth_design_matrix)
lmfit_meth = eBayes(lmfit_meth)
dep_meth = topTable(lmfit_meth, coef = 2, p.value = 0.05, lfc = 1.5, number = 2000)
head(dep_meth, 10)
##             logFC     AveExpr         t      P.Value    adj.P.Val        B
## FAS     -4.715015 -1.34396804 -14.71830 1.011289e-17 2.022578e-14 30.06113
## LGALS3  -5.039901 -0.91790723 -13.25543 3.301280e-16 2.565430e-13 26.71457
## MGST2   -4.946036 -0.62101747 -13.15154 4.268687e-16 2.565430e-13 26.46666
## KLF16   -5.093527 -0.43627070 -13.07749 5.130859e-16 2.565430e-13 26.28910
## PDE8A   -3.631714 -1.85922565 -12.78960 1.055703e-15 3.776033e-13 25.59200
## C9orf47 -4.719457  0.01980636 -12.69144 1.353294e-15 3.776033e-13 25.35181
## S1PR3   -4.719457  0.01980636 -12.69144 1.353294e-15 3.776033e-13 25.35181
## B3GNT5  -4.718897 -0.63151373 -12.64816 1.510413e-15 3.776033e-13 25.24553
## ADAM8   -4.862003 -0.43499471 -12.40208 2.833256e-15 6.296124e-13 24.63640
## SSH3    -3.969135 -0.72598362 -12.35709 3.181172e-15 6.362345e-13 24.52416
nrow(dep_meth)
## [1] 735
# compare with methylation with expression
gene_intersect = intersect(rownames(dep_expr), rownames(dep_meth)) # find common genes
length(gene_intersect) # find number of common genes
## [1] 62
# construct a dataframe of common genes methylation logFC and expression logFC
common_gene = data.frame(row.names = gene_intersect,
                         exprLFC = dep_expr[gene_intersect, ]$logFC,
                         methLFC = dep_meth[gene_intersect, ]$logFC)

# make plots of methylation logFC vs expression logFC
plot(common_gene$exprLFC, common_gene$methLFC,
     main = "Methylation logFC vs Expression logFC",
     xlab = "Expression LFC",
     ylab = "Methylation LFC")

# association
cor(common_gene$exprLFC, common_gene$methLFC)
## [1] -0.4954892
# determine the number of differentially expressed genes that have an epigenetic (DNA methylation) cause
same_direction = function(a,b) {
  ifelse(a == 0 | b == 0, as.logical("FALSE"), !xor(sign(a)+1,sign(b)+1)) 
}

sum(!same_direction(common_gene$exprLFC, common_gene$methLFC))
## [1] 60

Q4. With the survival data of the GBM tumors (GBM_clin.txt), make a Kaplan-Meier Curve to compare the two subtypes of GBM patients. Is there a significant difference in patient outcome between the two subtypes?

Answer: First I get the file from Cannon. Then I format the data (since the names are slightly different - uses dashes instead of periods). Finally I make the Kaplan-Meier Curve to compare the two subtypes of GBM patients. I notice that there a significant difference in patient outcome between the two subtypes because the gap between the two curve (type 0 has higher survival). The logrank test provides a p-value of 9e-04 which indicates the survival curves are significantly different across the two levels of covariates.

# get the file 
scp stat215u2107@login.rc.fas.harvard.edu:/n/stat115/2021/HW6/data/GBM_clin.txt part2/GBM_clin.txt
# load data
gbm_clin = read.csv("part2/GBM_clin.txt", sep = "\t")

# format the data
gbm_clin$X = str_replace_all(gbm_clin$X, "-", ".")
rownames(gbm_clin) = gbm_clin$X

# add in subtype information
kmc_data = merge(gbm_clin, tumor_clusters, by = 0)
# Make the Kaplan-Meier Curve
kmc_fit = survfit(Surv(days.to.death, vital.status) ~ type, data = kmc_data)
ggsurvplot(kmc_fit, pval = TRUE, pval.method = TRUE) +
    ggtitle("Survival for GBM Subtypes")

Q5. For Graduate Students: Use the differential genes (say this is Y number of genes) between the two GBM subtypes as a gene signature to do a Cox regression of the tumor samples. Does it give significant predictive power of patient outcome?

Answer: The cox regression models the hazard ratio against a set of covariates (in this case the differential genes). We will use the differentially expressed genes (as instructed from slack) as the predictors. This requires formatting our data table accordingly. Additionally, we need to perform lasso regression to identify the genes we should include since we have 385 differentially expressed genes compared to 60 samples. We perform the analysis provided in lab. I found the follow genes had non-zero coefficients in the LASSO procedure: FCGR2B, PPCS, C1S, IL8, and TNC.

Thus I’ll perform Cox regression on the survival object against cluster type, and genes FCGR2B, PPCS, C1S, IL8, and TNC (results below). From the summary output, I see none of the predictors are statistically significant at \(p < 0.05\). This is different from our results in Q4 where the type was statistically significant. However, both this model and the Kaplan-Meier Curve agree on the direction of effect for cluster type. We saw in the curve above, cluster 1 had lower survivor probability and in this cox model, we see a negative coefficient for type which agrees with the curve.

The cox model demonstrates significant predictive power of patient outcome, as the Likelihood ratio test, Wald test, and Score (logrank) test all attain significant p-values.

# make the dataframe
cox_data = data.frame(row.names = kmc_data$X,
                      "time" = kmc_data$days.to.death,
                      "death" = kmc_data$vital.status,
                      "type" = kmc_data$type)

# get the gene signatures
covariates = data.frame(t(gbm_expr_top2k[rownames(dep_expr), c(1:60)]))

# format data frame
cox_data = merge(cox_data, covariates, by = 0)
cox_data_nona = na.omit(cox_data)
# lasso analysis
set.seed(115)
x = as.matrix(cox_data_nona[,-c(1:4)]) 
survobj = Surv(cox_data_nona$time, cox_data_nona$death)
cvfit = cv.glmnet(x, survobj, family = "cox", alpha = 1) 
cvfit
## 
## Call:  cv.glmnet(x = x, y = survobj, family = "cox", alpha = 1) 
## 
## Measure: Partial Likelihood Deviance 
## 
##     Lambda Index Measure      SE Nonzero
## min 0.4091    10   7.352 0.11793       5
## 1se 0.6218     1   7.446 0.00622       0
# get non-zero coefficients
rownames(coef(cvfit, s = 'lambda.min'))[coef(cvfit, s = 'lambda.min')[,1]!= 0]
## [1] "FCGR2B" "PPCS"   "C1S"    "IL8"    "TNC"
# cox regression
coxreg = coxph(Surv(time, death) ~ type + PPCS + C1S + TNC, data = cox_data)
summary(coxreg)
## Call:
## coxph(formula = Surv(time, death) ~ type + PPCS + C1S + TNC, 
##     data = cox_data)
## 
##   n= 43, number of events= 43 
##    (17 observations deleted due to missingness)
## 
##         coef exp(coef) se(coef)      z Pr(>|z|)
## type -0.6806    0.5063   0.6334 -1.075    0.283
## PPCS  0.3890    1.4755   0.3129  1.243    0.214
## C1S   0.2571    1.2932   0.1778  1.446    0.148
## TNC   0.2622    1.2998   0.1740  1.507    0.132
## 
##      exp(coef) exp(-coef) lower .95 upper .95
## type    0.5063     1.9751    0.1463     1.752
## PPCS    1.4755     0.6777    0.7991     2.724
## C1S     1.2932     0.7733    0.9127     1.832
## TNC     1.2998     0.7693    0.9242     1.828
## 
## Concordance= 0.709  (se = 0.041 )
## Likelihood ratio test= 23.5  on 4 df,   p=1e-04
## Wald test            = 17.36  on 4 df,   p=0.002
## Score (logrank) test = 19.77  on 4 df,   p=6e-04

Q6. For Graduate Students: Many studies use gene signatures to predict prognosis of patients. Take a look at this paper: http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1002240. It turns out that most published gene signatures are not significantly more associated with outcome than random predictors. Write a script to randomly sample Y genes in this expression data as a gene signature and do Cox regression on the sampled signature to predict patient outcome. Automate the script and random sample followed by Cox regression 100 times. How does your signature in Q5 compared to random signatures in predicting outcome?

Answer: The script to randomly sample genes in this expression data as a gene signature and do Cox regression on the sampled signature to predict patient outcome is provided below. In Q4, we found from our LASSO analysis that 5 genes had non-zero coefficients. This will be the number of genes we randomly sample. Moreover, we’ll sample from the entire set of genes as opposed to restricting ourselves to the top 2000 variable genes. The criterion to evaluate model performance will be AIC. Lower AIC indicate a better model!

# baseline AIC performance
extractAIC(coxreg)
## [1]   4.0000 227.5671
set.seed(215) # get consistent results
k = 3 # number of genes to sample from Q4
iter = 100 # number of iterations

# data process of samples and genes 
gbm_expr_samples = gbm_expr
rownames(gbm_expr_samples) = gbm_expr_samples$X
gbm_expr_samples = data.frame(t(gbm_expr_samples[, -1]))[c(1:60), ]

# keep track of our results
sims = matrix(, nrow = iter, ncol = k + 1)

for (i in 1:iter) {
  # make our dataframe
  cox_data_iter = data.frame(row.names = kmc_data$X,
                      "time" = kmc_data$days.to.death,
                      "death" = kmc_data$vital.status,
                      "type" = kmc_data$type)
  
  # get random gene
  genes_iter = sample(colnames(gbm_expr_samples), 3)
  
  # update cox data 
  cox_data_iter = merge(cox_data_iter, 
                        gbm_expr_samples[,genes_iter],
                        by = 0)[, -1]
  
  # compute model
  coxreg_iter = coxph(Surv(time, death) ~ ., data = cox_data_iter)
  
  # store results
  sims[i, 1:k] = genes_iter
  sims[i, k+1] = extractAIC(coxreg_iter)[2]
}
# Visualization and Analysis of the results
sims = data.frame(sims)
colnames(sims) = c("Gene1", "Gene2", "Gene3", "AIC")
sims$AIC = as.numeric(sims$AIC)

# histogram
hist(sims$AIC, xlab = "AIC",
     main = "Cox Models with Random Genes")
abline(v = extractAIC(coxreg)[2], col = "red", lwd = 3, lty = 2)

# print genes with best AICs from the random process
head(sims[order(sims$AIC),], 5)
##    Gene1 Gene2   Gene3      AIC
## 8   RAE1  CD5L ALOX15B 227.6921
## 73  GALC  RIT1  ADAM17 230.9436
## 35 DNAH2  PKM2    BAP1 231.2094
## 68 TOP3B  UGT8   KNTC1 231.9484
## 50 POMT2  FZD4  CDK5R2 232.6234

The baseline AIC from the cox model in Q4 was 227.5671. I performed the process 100 times, storing the genes that were used and the AIC. I made a histogram of from the resulting AICs - the red line is our base AIC from Q4. We see the cox model with differentially expressed genes chosen by LASSO performed much better using the AIC as criteria. Looking at the best cox models from the simulations, our original cox model from Q4 beat. The best cox model from random genes was with genes RAE1, CD5L, and ALOX15B acheiving an AIC of 227.6921 (being roughly .13 greater).

My results are the opposite of the paper’s finding that random gene selection works better. Thus it seems like there are some genes that clearly are more relevant for the cox model. Another reason for my results are perhaps that I randomly sampled from all genes (so 12042 genes) instead of the subset of top 2000 variable genes. Let’s rerun the simulation and see the results when I sample from the top 2000 genes.

# Same analysis but with top 2000 variable genes
set.seed(215) # get consistent results
k = 3 # number of genes to sample from Q4
iter = 100 # number of iterations

# data process of samples and genes 
gbm_expr_samples = data.frame(t(gbm_expr_top2k))[c(1:60), ]

# keep track of our results
sims = matrix(, nrow = iter, ncol = k + 1)

for (i in 1:iter) {
  # make our dataframe
  cox_data_iter = data.frame(row.names = kmc_data$X,
                      "time" = kmc_data$days.to.death,
                      "death" = kmc_data$vital.status,
                      "type" = kmc_data$type)
  
  # get random gene
  genes_iter = sample(colnames(gbm_expr_samples), 3)
  
  # update cox data 
  cox_data_iter = merge(cox_data_iter, 
                        gbm_expr_samples[,genes_iter],
                        by = 0)[, -1]
  
  # compute model
  coxreg_iter = coxph(Surv(time, death) ~ ., data = cox_data_iter)
  
  # store results
  sims[i, 1:k] = genes_iter
  sims[i, k+1] = extractAIC(coxreg_iter)[2]
}
# Visualization and Analysis of the results
sims = data.frame(sims)
colnames(sims) = c("Gene1", "Gene2", "Gene3", "AIC")
sims$AIC = as.numeric(sims$AIC)

# histogram
hist(sims$AIC, xlab = "AIC",
     main = "Cox Models with Random Genes",
     xlim=c(225,240))
abline(v = extractAIC(coxreg)[2], col = "red", lwd = 3, lty = 2)

# print genes with best AICs from the random process
head(sims[order(sims$AIC),], 5)
##       Gene1 Gene2    Gene3      AIC
## 6   CAMK2N1  SRGN C13orf18 229.4004
## 12   FAM46A RAB32    IL1R2 229.6365
## 72     ACN9  IL33      NES 230.8928
## 50     BAI2 CXCL5     ODZ4 230.9201
## 78 C1orf103  OAS1     SOD2 231.2173

It seems even with limiting our sample to the top 2000 variable genes, we don’t find a random model that does better - in fact in this simulation run, none of the models generated with random signatures got as close compared to when we consider all possible genes. Thus we can conclude that in this particular case, using the LASSO with differentially expressed genes was beneficial as we weren’t able to find any other cox model that performed better.

Part III – Tumor mutation analyses and precision medicine

Q1. The MAF files contain the mutations of each tumor compared to the normal DNA in the patient blood. Write a script to parse out the mutations present in each tumor sample and write out a table. The table should rank the genes by how many times mutation happens in the tumor samples provided. Submit the table with the top 20 genes.

Answer: First I downloaded from Cannon. Then I used dplyr to get a table with the top 20 genes (provided below). The top gene was TP53 with 16 protein changes, followed by IDH1 with 12 protein changes.

# get the files 
scp -r stat215u2107@login.rc.fas.harvard.edu:/n/stat115/2021/HW6/data/MAF part3/
# bulk get file paths
maf_paths <- list.files(path = "part3/maf", 
                      pattern = "*.txt", 
                      full.names = TRUE, 
                      recursive = FALSE)

# load data
maf_data = ldply(maf_paths, read.delim)

# find number of protein change per gene
maf_gene_changes = maf_data %>% 
  group_by(Hugo_Symbol) %>% 
  dplyr::summarise(Count = n()) %>% 
  arrange(dplyr::desc(Count))

# list top 20 genes
maf_gene_changes[1:20, ]
## # A tibble: 20 x 2
##    Hugo_Symbol Count
##    <chr>       <int>
##  1 TP53           16
##  2 IDH1           12
##  3 ATRX            9
##  4 NF1             7
##  5 EGFR            6
##  6 TTN             6
##  7 Unknown         6
##  8 NBPF10          5
##  9 APOB            4
## 10 MUC17           4
## 11 PIK3CA          4
## 12 PIK3R1          4
## 13 PTEN            4
## 14 ABCB4           3
## 15 AHNAK2          3
## 16 CACNA1S         3
## 17 CCDC129         3
## 18 DLG5            3
## 19 DMBT1           3
## 20 GFRAL           3

Q2. Existing clinical genetic testing laboratories use information about the frequency of a mutation in cohorts, like from the GBM cohort in TCGA, to assess a mutation’s clinical significance (guidelines: https://www.ncbi.nlm.nih.gov/pubmed/27993330). Of the top 20 mutated genes in Q1, what is the most frequent protein change (hint: count mutations with the exact same amino acid change as the same)? In what gene does the most frequent protein change occur? Do you think this gene and protein mutation pair forms a genetic subtype of GBM (i.e. Is this gene-protein mutation pair seen more in one GBM subtype than the other)?

Answer: We will look at gene-protein pairs by concatenating those two columns (Hugo_Symbol and Protein_Change) as per lab instruction. Jack Kang on slack said to count all gene-protein pairs. Below are the results across all subtypes and then for each individual subtype.

# most common gene_protein sequence across both subtypes
maf_gene_protein_both = maf_data %>% 
  group_by(Hugo_Symbol, Protein_Change) %>% 
  dplyr::summarise(Count = n()) %>% 
  arrange(dplyr::desc(Count))
## `summarise()` has grouped output by 'Hugo_Symbol'. You can override using the `.groups` argument.
maf_gene_protein_both[1:20, ]
## # A tibble: 20 x 3
## # Groups:   Hugo_Symbol [16]
##    Hugo_Symbol Protein_Change Count
##    <chr>       <chr>          <int>
##  1 IDH1        p.R132H           12
##  2 HSD17B7P2   p.N175S            2
##  3 KLRC3       p.V11L             2
##  4 PARG        p.A584T            2
##  5 PIK3CA      p.E545K            2
##  6 PRKCD       p.G432fs           2
##  7 RRN3P1      p.L36L             2
##  8 TP53        p.R175H            2
##  9 TP53        p.R248Q            2
## 10 TP53        p.Y220C            2
## 11 A2M         p.S593S            1
## 12 AAAS        p.A114A            1
## 13 AADACL4     p.R264H            1
## 14 ABCA9       p.S991G            1
## 15 ABCB1       p.R1225H           1
## 16 ABCB4       p.A6T              1
## 17 ABCB4       p.T560T            1
## 18 ABCB4       p.Y365H            1
## 19 ABCC5       p.G252A            1
## 20 ABCC9       p.P1307P           1

Thus the most frequent protein change (considering both subtypes) is p.R132H on gene IDHI. This gene IDHI is the one that has the most frequent protein change occurring.

Now let’s stratify based on subtypes. We need to match on the cluster types from above.

# add label indicating their source sample
maf_data$sample = str_replace_all(
  substr(maf_data$Tumor_Sample_Barcode, 1, 12),
  "-", ".")

# add cluster label to the data
maf_data$type = tumor_clusters[maf_data$sample, 'type']

# subtype 0 MAF data
maf_data0 = maf_data[maf_data$type == 0, ]
  
# subtype 1 MAF data
maf_data1 = maf_data[maf_data$type == 1, ]
# most common gene_protein sequence across for subtype 0
maf_gene_protein_0 = maf_data0 %>% 
  group_by(Hugo_Symbol, Protein_Change) %>% 
  dplyr::summarise(Count = n()) %>% 
  arrange(dplyr::desc(Count))
## `summarise()` has grouped output by 'Hugo_Symbol'. You can override using the `.groups` argument.
maf_gene_protein_0[1:20, ]
## # A tibble: 20 x 3
## # Groups:   Hugo_Symbol [19]
##    Hugo_Symbol Protein_Change Count
##    <chr>       <chr>          <int>
##  1 IDH1        p.R132H           12
##  2 PARG        p.A584T            2
##  3 PRKCD       p.G432fs           2
##  4 TP53        p.R175H            2
##  5 TP53        p.Y220C            2
##  6 AADACL4     p.R264H            1
##  7 ABCA9       p.S991G            1
##  8 ABCB1       p.R1225H           1
##  9 ABCB4       p.T560T            1
## 10 ABCC5       p.G252A            1
## 11 ABCC9       p.P1307P           1
## 12 ABCF2       p.Q357*            1
## 13 ACACA       p.S1204del         1
## 14 ACADVL      p.A213T            1
## 15 ACE2        p.V581I            1
## 16 ACIN1       p.G1071R           1
## 17 ACOT12      p.T211A            1
## 18 ACOT8       p.R110H            1
## 19 ACVR1       p.A233A            1
## 20 ACVR1C      p.A4T              1
# most common gene_protein sequence across for subtype 1
maf_gene_protein_1 = maf_data1 %>% 
  group_by(Hugo_Symbol, Protein_Change) %>% 
  dplyr::summarise(Count = n()) %>% 
  arrange(dplyr::desc(Count))
## `summarise()` has grouped output by 'Hugo_Symbol'. You can override using the `.groups` argument.
maf_gene_protein_1[1:20, ]
## # A tibble: 20 x 3
## # Groups:   Hugo_Symbol [19]
##    Hugo_Symbol Protein_Change Count
##    <chr>       <chr>          <int>
##  1 HSD17B7P2   p.N175S            2
##  2 KLRC3       p.V11L             2
##  3 A2M         p.S593S            1
##  4 AAAS        p.A114A            1
##  5 ABCB4       p.A6T              1
##  6 ABCB4       p.Y365H            1
##  7 ABCC9       p.R1186Q           1
##  8 ABCD2       p.G79A             1
##  9 ABCG2       p.N76N             1
## 10 ABL1        p.A969T            1
## 11 ACAD11      p.T133R            1
## 12 ACADS       p.R330C            1
## 13 ACPL2       p.E145*            1
## 14 ACSM2A      p.E490V            1
## 15 ACTL9       p.V380I            1
## 16 ADAD1       p.F237L            1
## 17 ADAM20      p.H343H            1
## 18 ADAM22      p.G448S            1
## 19 ADAM28      p.T297_splice      1
## 20 ADAM29      p.D85D             1

Stratifying based on the tumor type, we see that the IDH1 p.R132H gene protein mutation is specific to one type (namely type 0 here) as the other type doesn’t not include it in it’s top 20 common gene-protein mutation pairs. Thus this gene-protein mutation pair is seen more in one GBM subtype than the other as demonstrated by the above tables. This is a useful tool in uncovering different characteristics of different types of tumors.

Q3. CBioPortal has a comprehensive list of tumor profiling results for interactive visualization. Go to cBioPortal (http://www.cbioportal.org), and select either “Glioblastoma” under “CNS/Brian” (left) or select “TCGA PanCancer Atlas Studies” under “Quick Select” (middle). Input each gene in Q1 and click Submit. From the OncoPrint tab, you can see how often each gene is mutated in GBM or all TCGA cancer types. Based on this, which of the genes in Q1 is likely to be a cancer driver gene?

Answer: I provide a screen shot below. The position of mutation determines whether a gene is likely to be a driver gene. Specifically, we are looking for genes that have many mutations or have mutations across a large section of their gene body. Based on cBioPortal, I find TP53, TTN, PIK3CA, and PTEN to be genes likely to be a cancer driver gene. These genes have both a large number of permutations (their percentage value were double digits) and their permutations are wide spread across their gene bodies. Interestingly, I saw genes with permutations that were wide spread (those on the both of the screen shot for example), but the permutations affected a relatively small percentage of their gene body, hence I didn’t determine them as cancer drive cells.

# get list of HUGO symbols from Q1 analysis
write.csv(maf_gene_changes$Hugo_Symbol[1:20], file = "part3/top20genes.txt",
          row.names = FALSE, quote = FALSE)

cBioPortal Results

Q4. From the Mutation tab on the cBioPortal result page, is this mutation a gain or loss of function mutation on the gene you identified from Q2?

Answer: The gene-protein mutation I identified from Q2 was IDH1 p.R132H protein mutation. Gain of function is when the mutation happens frequently at a specific position. Loss of function is when the mutation happens across many spots with low frequency. From this, IDH1 p.R132H is a gain of function because mutation occurs 480 times and all in similar spot (indicated by the dot with the long bar in the screen shot).

cBioPortal Mutation Tab for IDH1

Q5. From cBioPortal, select Glioblastoma (TCGA provisional, which has the largest number of samples) and enter the driver mutation gene in Q2. From the Survival tab, do GBM patients with this mutation have better outcome in terms of progression free survival and overall survival?

Answer: I couldn’t find Glioblastoma (TCGA provisional) so instead I chose the Glioblastoma study with the largest number of sample (as per Jack’s instruction on Slack). That was study: Glioblastoma Multiforme (TCGA, Firehose Legacy) with 604 samples. I query with IDH1 gene from Q2. The screenshot below has the survival curve. I see the logrank test p-value is significant at \(p < 0.05\), thus the difference is significant. The blue curve is for unaltered whereas the red curve is for altered. Since we observe the red curve as higher than the blue (for the months 0-50), we can conclude GBM patients with this mutation have better outcome in terms of progression free survival and overall survival.

cBioPortal Mutation Tab for IDH1

Q6. You are working with an oncologist collaborator to decide the treatment option for a GBM patient. From exome-seq of the tumor, you identified the top mutation in Q2. To find out whether there are drugs that can target this mutation to treat the cancer, go to https://www.clinicaltrials.gov to find clinical trials that target the gene in Q2. How many trials are related to glioblastoma? How many of these are actively recruiting patients which this patient could potentially join? Hint: Search by the disease name and gene name.

Answer: The disease is Glioblastoma and the gene is IDH1. That is what I searched for on the Clincal Trials website. The tool provide total number of matches for my searches as well as the ability to filter based on recruiting and type of study (clinical). Hence I could just use the portal to find the numbers.

I found 16 total trials with 8 actively recruiting patients which this patient could potentially join.

Total Trials

Actively Recruiting

Part IV- CRISPR screens

We will learn to analyze CRISPR screen data from this paper: https://www.ncbi.nlm.nih.gov/pubmed/?term=26673326. To identify therapeutic targets for glioblastoma (GBM), the author performed genome-wide CRISPR-Cas9 knockout (KO) screens in patient-derived GBM stem-like cell line (GSCs0131).

MAGeCK tutorial: https://sourceforge.net/p/mageck/wiki/Home/ https://sourceforge.net/projects/mageck/

Q1. Use MAGeCK to do a basic QC of the CRISPR screen data (e.g. read mapping, ribosomal gene selection, replicate consistency, etc). Comment on the quality of the data based on your results.

Directory: /n/stat115/2021/HW6/data/crispr_data

# making my batch file (based on lab specifications): this is for downstream analysis
cat ->qc.sh <<EOF
#!/bin/bash
#SBATCH -n 1
#SBATCH -N 1
#SBATCH -t 0-06:00 
#SBATCH -p serial_requeue 
#SBATCH --mem=1000 
#SBATCH -o mageck.out 
#SBATCH -e mageck.err 
#SBATCH --mail-type=ALL
#SBATCH --mail-user="nahmed@college.harvard.edu"

module load Anaconda/5.0.1-fasrc01

source activate /n/stat115/2021/HW6_env2

mageck count -l /n/stat115/2021/HW6/data/crispr_data/library.csv -n OUT --sample-label Day0,Day23 \
--fastq /n/stat115/2021/HW6/data/crispr_data/GSC_0131_Day0_Rep1.fastq.gz,/n/stat115/2021/HW6/data/crispr_data/GSC_0131_Day0_Rep2.fastq.gz /n/stat115/2021/HW6/data/crispr_data/GSC_0131_Day23_Rep1.fastq.gz,/n/stat115/2021/HW6/data/crispr_data/GSC_0131_Day23_Rep2.fastq.gz

EOF

sbatch qc.sh

# get output summary file
mageck test -k OUT.count.txt -t Day23 -c Day0 -n OUT

# bring file to local machine
scp stat215u2107@login.rc.fas.harvard.edu:OUT.countsummary.txt part4/countsummary.txt
scp stat215u2107@login.rc.fas.harvard.edu:OUT.gene_summary.txt part4/gene_summary.txt


# making my batch file (based on lab specifications): this is for replicate analysis
cat ->replicates.sh <<EOF
#!/bin/bash
#SBATCH -n 1
#SBATCH -N 1
#SBATCH -t 0-06:00 
#SBATCH -p serial_requeue 
#SBATCH --mem=1000 
#SBATCH -o mageck.out 
#SBATCH -e mageck.err 
#SBATCH --mail-type=ALL
#SBATCH --mail-user="nahmed@college.harvard.edu"

module load Anaconda/5.0.1-fasrc01

source activate /n/stat115/2021/HW6_env2

mageck count -l /n/stat115/2021/HW6/data/crispr_data/library.csv -n OUT --sample-label Day0_Rep1,Day0_Rep2,Day23_Rep1,Day23_Rep2 \
--fastq /n/stat115/2021/HW6/data/crispr_data/GSC_0131_Day0_Rep1.fastq.gz /n/stat115/2021/HW6/data/crispr_data/GSC_0131_Day0_Rep2.fastq.gz /n/stat115/2021/HW6/data/crispr_data/GSC_0131_Day23_Rep1.fastq.gz /n/stat115/2021/HW6/data/crispr_data/GSC_0131_Day23_Rep2.fastq.gz

EOF

sbatch replicates.sh

scp stat215u2107@login.rc.fas.harvard.edu:OUT.count_normalized.txt part4/count_normalized.txt
#  QC check
countsummary = read.csv("part4/countsummary.txt", sep = "\t")
countsummary
##                                                                File Label
## 1 /n/stat115/2021/HW6/data/crispr_data/GSC_0131_Day23_Rep1.fastq.gz Day23
## 2 /n/stat115/2021/HW6/data/crispr_data/GSC_0131_Day23_Rep2.fastq.gz Day23
## 3  /n/stat115/2021/HW6/data/crispr_data/GSC_0131_Day0_Rep2.fastq.gz  Day0
## 4  /n/stat115/2021/HW6/data/crispr_data/GSC_0131_Day0_Rep1.fastq.gz  Day0
##      Reads   Mapped Percentage TotalsgRNAs Zerocounts GiniIndex NegSelQC
## 1 62818064 39992777     0.6366       64076         57   0.08510        0
## 2 58686580 37836392     0.6447       64076         51   0.08587        0
## 3 47289074 31709075     0.6705       64076         17   0.07496        0
## 4 51190401 34729858     0.6784       64076         14   0.07335        0
##   NegSelQCPval NegSelQCPvalPermutation NegSelQCPvalPermutationFDR NegSelQCGene
## 1            1                       1                          1            0
## 2            1                       1                          1            0
## 3            1                       1                          1            0
## 4            1                       1                          1            0

Answer: Made a bash script to complete the quality control. From lab, good quality usually have the follow metrics: percentage of mapped reads is above 0.6, zero count is less than 0.1, and gini index is less than 0.1.

From the quality control report from mageck, I found that the percentage of mapped reads is above 0.6 for all fastq files, the zero counts are more than 1000 magnitudes smaller so it less than 0.1, and the gini indices are all smaller than 0.1. Thus these fastq files have good quality based on these metrics/guidance.

Now I’ll inspect the ribosomal genes from the gene_summary.txt files (I’ll use grep looking for genes that start with RP). Since ribosomal genes are vital for survival, they should be ranked high in the gene summary file.

# ribosomal gene check
gene_summary = read.csv("part4/gene_summary.txt", sep = "\t")
RP_genes = subset(gene_summary, grepl("^RP", gene_summary$id) == TRUE)
hist(RP_genes$neg.rank, xlab = "ranks",
     main = "Ranks of Ribosomal Genes")

The gene summary file has 16,647 different genes. The histogram for the ranks for ribosomal genes are left skewed, which means that more of these genes have high rank, which is what we what we should expect. Thus the fastq files also pass quality in terms of ribosomal genes.

Lastly, we’ll look at the replicate consistency by using a heatmap on count_normalized.txt. We should expected the two files for Day 0 are highly correlated with each other. We should also expected the two files for Day 23 are highly correlated with each other. We’ll need to re-run mageck with all four files separately to perform this analysis. Below are the results.

# replicate consistency check
count_normalized = read.csv("part4/count_normalized.txt", sep = "\t")
# plot heatmap
# heatmap(count_normalized[, c[2:6]])

From the heatmap, I see that the Day 0 fastq files are highly correlated with each other. Similarly for the Day 23 fastq files

Q2. Analyze CRISPR screen data with MAGeCK to identify positive and negative selection genes. How many genes are selected as positive or negative selection genes, respectively, and what are their respective enriched pathways?

Answer: We can read in gene_summary.txt file using ReadRRA() command. This gives use FDR scores which we can filter on - we want FDR < 0.05 as those are significant. Then we can use the score to find the positive and negative genes. I found 8 positive selection genes (at FDR < 0.05) and 368 negative selection genes (at FDR < 0.05).

# get score and FDR
mageck_genes = ReadRRA("part4/gene_summary.txt")

# filter for significant gene
mageck_genes_sig = mageck_genes[mageck_genes$FDR < 0.05, ]

# get positive and negative genes
mageck_genes_pos = mageck_genes_sig[mageck_genes_sig$Score > 0, ]
mageck_genes_neg = mageck_genes_sig[mageck_genes_sig$Score < 0, ]

# get the counts
nrow(mageck_genes_pos)
## [1] 8
nrow(mageck_genes_neg)
## [1] 368
# store the genes in text file for DAVID
write.csv(mageck_genes_pos$id, file = "part4/positive_genes.txt",
          row.names = FALSE, quote = FALSE)
write.csv(mageck_genes_neg$id, file = "part4/negative_genes.txt",
          row.names = FALSE, quote = FALSE)

Now I’ll run DAVID on these genes with identifier of Official_Gene_Symbol (screen shots provided below).

For positive genes:

Positive Gene

For negative genes:

Negative Gene

I see that for diseases, both negative and positive selection genes have connection with OMIM_DISEASE (which is neurological issues with vanishing white matter). They also seem to share similar gene ontology. However for negative selection gene, they are part of a functional category of COG_ONTOLOGY which positive selection genes aren’t.

Q3. For Graduate Students: Genes negatively selected in this CRISPR screen could be potential drug targets. However, if they are always negatively selected in many cells, targeting such genes might create too much toxicity to the normal cells. Go to depmap (DepMap.org) which has CRISPR / RNAi screens of over 500 human cell lines, Click “Tools”  Data Explorer. Pick the top 3 negatively selected genes to explore. Select Gene Dependency from CRISPR (Avana) on the X axis and Omics from Expression on the Y axis, to see the relationship between the expression level of the gene and dependency (CRISPR screen selection) of the gene across ~500 cell lines. Are the top 3 genes good drug targets?

Answer: Negative selection genes are vital for GBM cancer cell survival and can potentially be drug targets. However we also need to worry about whether they are negatively selected for normal cells (potential collateral damage). We can address this by looking at the relationship between expression and dependency across many cell lines. If we observe that expression is highly correlated with dependency, this means that the alteration of the gene expression will affect the survival of many cells, and we should not use this as a drug target. We want those with low correlation!

Given the tool might not have the exact gene names, I will also use alias from other databases to help the tool out. I selected the top 3 negatively genes by first FDR and then score, thus getting the these genes: RPL, ASH2L, and LAS1L.

# get top 3 negatively selected genes (by FDR, and then score)
head(mageck_genes_neg[order(mageck_genes_neg$FDR, 
                            mageck_genes_neg$Score), ], 5)
##       id   Score      FDR
## 9    RPL -2.6594 0.000291
## 31 ASH2L -2.5181 0.000291
## 4  LAS1L -2.2766 0.000291
## 18  XRN1 -2.2725 0.000291
## 15 MRPL9 -2.0155 0.000291

The results for RPL are below. We achieve a pearson \(r^2\) of 0.180 and spearman \(r^2\) of 0.156. The correlation is moderate, thus we can say that there is some relationship between expression and dependency. Thus I would not recommended for drug targeting.

RPL Graph

The results for ASH2L are below. We achieve a pearson \(r^2\) of 0.394 and spearman \(r^2\) of 0.405. The correlation is much bigger than the correlation for RPL above. I would say that expression and dependency is again moderate for this gene, so I would not recommend this gene as a drug target.

ASH2L Graph

The results for LAS1L are below. We achieve a pearson \(r^2\) of 0.216 and spearman \(r^2\) of 0.210. The correlation is moderate, thus we can say that there is some relationship between expression and dependency. Thus I would not recommended for drug targeting.

LAS1L Graph

In summary, by looking at the top 3 negatively selected genes, VARS2, PSMA, and SNAPC2, we found that all had moderate amount of correlation according to the depmap tool - thus not making them great candidates for drug targeting.

Initially, I did this problem wrong by sorting by score and not FDR. When I plot those genes ranked by score, I saw the correlations were super low (less than 0.1) - I felt genes with that type of correlation would be ideal for drug targeting.

Q4. For Graduate Students: Let’s filter out pan essential genes (PanEssential.txt) from the negatively selected genes in Q2. Take the remaining top 10 genes, and check whether those genes have drugs or are druggable from this website: http://www.oasis-genomics.org/. Go to Analysis -> Pan Cancer Report, enter the top 10 genes and check the table for druggability (more druggable for higher number on Dr). Which of these genes are druggable?

Note: If the above website cannot be opened, try https://www.dgidb.org/. Go to Search Potential Druggability or Search Drug-Gene Interactions and enter the top 10 genes.

Answer: First I get the PanEssential file. Then I filter those genes from negatively selected genes. Then I rank based first on FDR and then score, and get the top 10.

# get the files 
scp stat215u2107@login.rc.fas.harvard.edu:/n/stat115/2021/HW6/data/PanEssential.txt part4/PanEssential.txt
# load data
PanEssential = read.csv("part4/PanEssential.txt", sep = "\t")

# filter genes
filter_genes = mageck_genes_neg[!(mageck_genes_neg$id %in% PanEssential$GeneSymbol), ]

# sort and order
head(filter_genes[order(filter_genes$FDR, 
                        filter_genes$Score), ], 10)
##        id   Score      FDR
## 9     RPL -2.6594 0.000291
## 31  ASH2L -2.5181 0.000291
## 18   XRN1 -2.2725 0.000291
## 15  MRPL9 -2.0155 0.000291
## 2  SRSF10 -1.8816 0.000291
## 7   TCOF1 -1.8747 0.000291
## 6  ATP6V0 -1.6931 0.000291
## 13   PMVK -1.3809 0.000291
## 17    HMB -1.3727 0.000291
## 1     NF2 -1.3624 0.000291
# write to txt file
write.csv(head(filter_genes[order(filter_genes$FDR, 
                        filter_genes$Score), ], 10)$id, 
          file = "part4/drug_gene.txt",
          row.names = FALSE, quote = FALSE)

The first website doesn’t work, so I’m using the second one. I attach screen shots of my results. It seems the genes that got matched were ASH2L, TCOF1, PMVK, and NF2. These genes had more information about druggability, thuse I would say 4 genes are druggable. As for a druggable score, I couldn’t find that option on the second website.

Drug Database Results

Part V. Cancer immunology and immunotherapy

Immune checkpoint inhibitors, which primarily activate CD8 T cells, have shown remarkable efficacy in melanoma (SKCM), but haven’t worked as well in GBM patients. Let’s explore the tumor immune microenvironment from TCGA data. Although the cancer patients in TCGA were not treated with immunotherapy, their response to other drugs and clinical outcome might be influenced by pre-treatment tumor immune microenvironment

Q1. TIMER (http://timer.cistrome.org/) estimated the infiltration level of different immune cells of TCGA tumors using different immune deconvolution methods. CD8A and CD8B are two gene markers on CD8 T cells. On the Diff Exp tab, compare the expression level of either CD8A or CD8B between GBM and SKCM (Metastatic Melanoma). Based on this, which cancer type have more CD8 T cells?

Answer: Screen shots provide below.

CD8A

Looking at CD8A and comparing GBM vs SKCM, we see that SKCM has more expression of CD8A (as well as a greater range of possible expression values). Also the expression of CD8A for SKCM is statistical significant indicated by asterisk.

CD8B

Similarly looking at CD8B and comparing GBM vs SKCM, we see that SKCM has more expression of CD8B (as well as a greater range of possible expression values). Also the expression of CD8B for SKCM is statistical significant indicated by asterisk.

Thus we can conclude that Metastatic Melanoma has more CD8 T cells.

Q2. On the Gene tab, select both GBM and SKCM (Metastatic Melanoma), include CD8 T cells as the cell infiltrate. Check the following genes, PDCD1(PD1), CD274(PDL1), CTLA4 which are the targets of immune checkpoint inhibitors, to see whether their expression level is associated with immune cell infiltration in the GBM and SKCM tumors. Their higher expression usually indicate that T cells are in a dysfunctional state, which immune checkpoint inhibitors aim to revive.

Answer: Screen shots provide below.

PDCD1

For PDCD1, we see the GBM row is lightly colored (indicating it has weak correlation with PDCD1) and several x-marks which means it doesn’t get statistical significance for those categories. Whereas, SKCM is darkly colored red (indicating it has strong positive correlation with PDCD1) and no x-marks which means it is statistical significance for all those categories.

CD274

For CD274, we a similar trend as before. GBM has weak and slightly negative correlation, with main categories not being statistically significant, whereas SKCM is mostly colored red (indicating it has strong positive correlation with CD274) and achieving statistical significance for all categories.

CTLA4

For CTLA4, we a similar trend as before. GBM has positive correlation with CTLA4, but it is weaker than SKCM’s correlation. Similarly, SKCM is statistically significant on more categories than GBM.

In conclusion, we find that these genes expression level is associated with higher immune cell infiltration for SKCM tumors than for GBM (from the above graphs).

Q3. On the Survival tab, select both GBM and SKCM, include CD8 T cell as the cell infiltrate, add tumor stage and patient age as the clinical variables to conduct survival analyses. Based on the Cox PH model, what factors are the most significantly associated with patient survival in each cancer type? Plot the Kaplan-Meier curve to evaluate how the immune cell infiltrate is associated with survival. Is CD8 T cell associated with patient survival in each cancer type?

Answer: Screen shot attached below.

Outcomes

We see high negative z-scores for SKCM compare to GBM. Now’s look at the cox models below:

COX: GBM

COX: SKCM

We see that age is significant for both GBM and SKCM. However for SKCM, stage also matters as we see that appear in the resulting Cox Model (particularly Stage 4 and Stage 3). Both models are significant given their summary statistics.

Q4. For Graduate Students: Based on the above observations, can you hypothesize why immune checkpoint inhibitors don’t work well for some GBM patients?

Answer: The reason that immune checkpoint inhibitors don’t work well for some GBM patients is because 1) CD8A and CD8B isn’t as present, as seen in Q1. Moreover, the expression from the genes replace for these inhibitors is not really associated with higher immune cell infiltration (as we saw in Q2 with the graphs). Lastly, I image that GBM has to do a lot with age and our body naturally become more weathered as we get older. For these reasons, we see that immune checkpoint inhibitors don’t work well for some GBM patients.

Rules for submitting the homework:

Please submit your solution directly on the canvas website. Please provide both your code in this Rmd document and an html file for your final write-up. Please pay attention to the clarity and cleanness of your homework.

The teaching fellows will grade your homework and give the grades with feedback through canvas. Some of the questions might not have a unique or optimal solution. TFs will grade those according to your creativity and effort on exploration, especially in the graduate-level questions.